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Abstract 

Background: Enormous molecular sequence data have been accumulated over the past several years and are still 
exponentially growing with the use of faster and cheaper sequencing techniques. There is high and widespread 
interest in using these data for phylogenetic analyses. However, the amount of data that one can retrieve from 
public sequence repositories is virtually impossible to tame without dedicated software that automates processes. 
Here we present a novel bioinformatics pipeline for downloading, formatting, filtering and analyzing public 
sequence data deposited in GenBank. It combines some well-established programs with numerous newly 
developed software tools (available at http://software.zfmk.de/). 

Results: We used the bioinformatics pipeline to investigate the phylogeny of the megadiverse insect order 
Hymenoptera (sawflies, bees, wasps and ants) by retrieving and processing more than 120,000 sequences and by 
selecting subsets under the criteria of compositional homogeneity and defined levels of density and overlap. Tree 
reconstruction was done with a partitioned maximum likelihood analysis from a supermatrix with more than 
80,000 sites and more than 1,100 species. In the inferred tree, consistent with previous studies, "Symphyta" is 
paraphyletic. Within Apocrita, our analysis suggests a topology of Stephanoidea + (Ichneumonoidea + 
(Proctotrupomorpha + (Evanioidea + Aculeata))). Despite the huge amount of data, we identified several persistent 
problems in the Hymenoptera tree. Data coverage is still extremely low, and additional data have to be collected 
to reliably infer the phylogeny of Hymenoptera. 

Conclusions: While we applied our bioinformatics pipeline to Hymenoptera, we designed the approach to be as 
general as possible. With this pipeline, it is possible to produce phylogenetic trees for any taxonomic group and to 
monitor new data and tree robustness in a taxon of interest. It therefore has great potential to meet the 
challenges of the phylogenomic era and to deepen our understanding of the tree of life. 



Background 

Reconstructing the phylogeny of organisms, the tree of 
life, is one of the major goals in biology and is essential 
for research in other biological disciplines ranging from 
evolutionary biology and systematics to biological control 
and conservation. In phylogenetics, molecular characters 
have become an indispensable tool, since they can be col- 
lected in a standardized and automated way. This is indi- 
cated by the exponential growth of published data, with a 
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current doubling time of approximately 30 months [1] 
and expected massively accelerated data generation over 
the next several years. The sequencing of expressed 
sequence tags (ESTs), complete genomes and countless 
single-gene fragments has resulted in enormous, yet 
highly incomplete and unbalanced, data sets accessible 
via public databases such as the National Center for Bio- 
technology Information (NCBI) GenBank, the European 
Molecular Biology Laboratory (EMBL) and the DNA 
Database of Japan (DDBJ). 

The accumulation of new data is, of course, important, 
but the potential of the currently available data for phy- 
logenetic analysis has not yet been sufficiently explored. 
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McMahon and Sanderson [2], Sanderson et al. [3] and 
Thomson and Shaffer [4] have published their attempts 
to use molecular data from public databases and to pro- 
cess them for phylogenetic analysis. However, these 
approaches, while valuable and trend-setting, did not 
offer thorough solutions and call for extension, improve- 
ments and updates in terms of generalization, detail, 
analysis and degree of automation. Any new approach 
must offer solutions to deal with data scarcity, poor data 
overlap, nonstationary substitution processes, base com- 
positional heterogeneity and data quality deficits. In this 
study, we address these problems with a newly devel- 
oped bioinformatics pipeline. We use a large exemplar 
taxon for which far more than 100,000 sequences have 
been published and show that comprehensive analyses 
can potentially deliver new results which were not avail- 
able from each included data set separately. 

As an exemplary taxon, we chose the insect order 
Hymenoptera, which comprises prominent groups such as 
bees, ants and wasps, the latter including the overwhelm- 
ing armada of parasitoid species [5]. The Hymenoptera 
seem well-suited to demonstrate the power of our 
approach, since the taxon is megadiverse and offers a 
number of phylogenetic challenges, including many unre- 
solved relationships and well-known problems that are 
associated with so-called long-branch taxa and rapid radia- 
tions (see, for example, [6-8]). Over a long period, com- 
paratively few authors tried to resolve the phylogenetic 
relationships of the major lineages of Hymenoptera (see, 
for example, [9-16]). In recent years, however, interest and 
effort in solving higher-level relationships within the 
Hymenoptera have notably increased and led to the publi- 
cation of an extensive analysis based exclusively on mor- 
phological characters [17], a study using complete 
mitochondrial genomes [18], a supertree approach using 
previously published trees [19], a phylogenetic estimate 
based on EST data [20] and a taxon-rich four-gene study 
[21]. In the past five years, complete nuclear genomes of 
several Hymenoptera species have been sequenced. Most 
noteworthy in this context are the genomes of the honey 
bee Apis mellifera [22] and the jewel wasp Nasonia vitri- 
pennis, with its sibling species N. giraulti and N. longicor- 
nis [23] . These genomes contributed significantly to the 
amount of sequence data available for Hymenoptera. 
However, their number is still too small to profitably aug- 
ment phylogenetic analyses. 

Overall, there are only few phylogenetic hypotheses on 
major lineages within Hymenoptera that are generally 
accepted. These are as follows: (1) "Symphyta" (sawflies) 
are paraphyletic, with the absence of the constriction 
between the first and second abdominal segments (that 
is, the wasp waist) as a symplesiomorphic character; (2) 
Apocrita (wasp-waisted wasps) are monophyletic (see, for 



example, [24]); (3) Xyelidae are sister group to all other 
Hymenoptera (see, for example, [25-27]); (4) Orussidae 
are sister group to Apocrita (see, for example, [17,18,27]) 
and (5) Aculeata (stinging wasps; Apoidea, Chrysidoidea 
and Vespoidea) are monophyletic (see, for example, [28]). 
In addition, most of the 22 currently recognized superfa- 
milies are presumed to be monophyletic (see [29] for a 
synopsis). Numerous relationships within Hymenoptera 
are still unresolved. Among them, the most intriguing 
ones are the phylogeny of the major lineages within 
Apocrita, and in particular what the sister group of Acu- 
leata is, and the monophyly and phylogeny of Proctotru- 
pomorpha sensu Rasnitsyn 1988 [13] (Chalcidoidea, 
Cynipoidea, Diaprioidea, Mymarommatoidea, Platygas- 
troidea and Proctotrupoidea). 

In this study, we present a standardized, fast and trans- 
parent bioinformatics pipeline to collect, filter and analyze 
public sequence data deposited in GenBank. The pipeline 
is designed to be generally applicable in terms of taxa, 
genes and the variety of potential users. We apply this 
pipeline to sequences of Hymenoptera and discuss our 
results against the background of current hypotheses on 
two selected questions: the phylogeny of the major lineages 
within Apocrita and the monophyly and phylogeny of 
Proctotrupomorpha. Additionally, we use the results to 
diagnose persistent problems in the hymenopteran tree. 
Finally, we illustrate the merit of being able to easily gener- 
ate trees from available sequence data at a time when data 
sets are accumulating at an ever-increasing speed. 

Methods 

We developed a bioinformatics pipeline that includes 
automated data retrieval, processing, filtering and analy- 
sis of sequence data using available programs in combi- 
nation with newly developed scripts. The individual steps 
of the pipeline are illustrated in Figure 1. Those steps 
that are executed by new scripts are highlighted in blue. 
These scripts can be downloaded from http://software. 
zfmk.de/ or accessed as part of Additional file 1. They 
have been written in the Ruby or Perl programming lan- 
guage and will run on any Linux operating system. Each 
of our scripts comes with a manual that provides more 
detailed information on what it does and how to use it 
(manuals are available at http://software.zfmk.de/ and 
also are located in Additional file 1). Table 1 summarizes 
all new scripts and their respective tasks. To maneuver 
through the pipeline, each script has to be manually 
started with the output from the preceding step. This 
allows the user to manually interfere at each step or to 
modify the pipeline to adapt it to new demands. In the 
following paragraphs, we explain the individual steps of 
the pipeline using the example of the analysis of Hyme- 
noptera sequences deposited in GenBank. 
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- Download from GenBank [I] 



Nuclear seqs 

I 

Standardize headers [a. I] 

I 

Assembly of coding seqs (CAP3) [a. II] 

I 

Search for orthologs (HaMStR) [a. Ill] 
Choose longest seq per species and gene [a. IV] 



1 



L. 



Mitochondrial seqs + nuclear non-coding seqs 

I 

Standardize headers [b.l] 

I 

Split sequences to single genes [b.l!] 

I 

Check strand polarity and sequence similarity in blast2seq [b.lll] 

I 

Choose longest seq per species and gene [b.lV] 
Translate coding mt seqs from nt to aa [b.V] 



- Delete groups of orthologs with < 3 species [II] 

I 

Delete species with only 1 seq and 
groups with < 3 species [III] 

I 

Alignment (MAFFT) [IV] 
I 

Refinement of alignment (MUSCLE) [V] 
I 

Backtranslate coding mt seqs from aa to nt [VI] 

I 

Mask alignment ambiguous or highly divergent regions (ALISCORE, Gblocks, gapkiller) [VII] 

I 

Select codon positions 1 and 2 in coding mt genes [VIII] 

I 

Select maximum clique of seqs with > 100nt or > 100aa overlap [IX] 

I 

Select second maximum clique from rest [X] 



,_l 



Ban compositional heterogeneity [XI] 

I 

Ban compositional heterogeneity from rest [XII] 



Delete species with only 1 seq and 
groups with < 3 species [XIII] 

I 

Prune genera to 15 species [XIV] 

I 

Select largest group of species that overlap in > 1 group of orthologs [XV] . 



Concatenate to supermatrix [XVI] 

I 

Partitioned maximum likelihood analysis 
with rapid bootstraps (ProtTest, RAxML) [XVII] 

I 

The tree [XVIII] 



Subset 2: Add 2 species 
from excluded supertaxa 

Concatenate to supermatrix 



ML analysis 

I 

Tree 2 



Figure 1 Outline of our pipeline that processes GenBank sequence data for phylogenetic analysis Steps that are executed by newly 
developed scripts are highlighted in blue, and external programs are written in parentheses after step description. Steps that directly refer to the 
phylogenetic analysis are highlighted in red. The additional procedure to infer subset 2 is shaded in gray. 



Sequence data retrieval and data processing brackets correspond to those in Figure 1). The script 

We downloaded all sequences of Hymenoptera depos- searched for the query taxon (Hymenoptera) in the 

ited in GenBank 172.0 (as of 18 August 2009) with the nucleotide and in the EST database of GenBank (NCBI) 

aid of the script proseqeo [I] (Roman numerals in square and stored the sequences of each species in a separate 
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Table 1 New scripts used in our pipeline 3 



Step 


Number 


Script 


Download from GenBank 


[1] 


pfOSeqCO 


Standardize headers 


[a.l], [b.l] 


header_standardizer 


Split sequences to single genes 


[b.ll] 


multiple sequence splitter 


Check strand polarity and sequence similarity 


[b.lll] 


checking seq 


Choose longest sequence per species and gene 


[a.rV], [b.lV] 


choose_longest_seq 


Translate coding mitochondrial sequences from nucleotides to amino acids 


[b.V] 


dna2aa 


Delete groups of orthologs with three or fewer species 


[II], Oil], [XIII] 


small_groups_deleter 


Delete species with only one sequence 


[III], [XIII] 


taxon_deleter 


Backtranslate coding mitochondrial sequences from amino acids to nucleotides 


[VI] 


aa2dna 


Mask gappy regions in alignment 


[VII] 


gap_killer 


Select maximum clique of overlapping sequences 


[IX], [X] 


minimum_sequence_overlap 


Ban compositional heterogeneity 


[XI], [XII] 


nucleotide_chi 


Prune genera to best represented species 


[XIV] 


prune_genera 


Select largest group of species that overlap in at least one group of orthologs 


[XV] 


reduce2leading_gene 


Concatenate alignments 


[XVI] 


concatenator 



a Available at http://software.zfmk.de/ and in Additional file 1. All scripts were written in Ruby, except for checking_seq, which was written in Perl. Numerals 
(column "Number") correspond to those in Figure 1. 



Fasta file. Mitochondrial sequences plus nuclear non- 
coding sequences (ITS1, ITS2 and nuclear rRNA) (right 
path b) and all other nuclear sequences (left path a) 
were retrieved in two separate downloads. For outgroup 
comparison, we additionally retrieved sequence data of 
the transcriptome, the nuclear noncoding genes and the 
complete mitochondrial genome of Bombyx mori (Lepi- 
doptera), Aedes aegypti (Diptera) and Tribolium casta- 
neum (Coleoptera). The gi numbers of all downloaded 
sequences are listed in Additional file 2. 
Left path a 

The nuclear sequences were assembled into contigs for 
each species using the sequence assembly program CAP3 
[30] [a.II]. Orthologous sequences were identified using 
HaMStR 1.3 [31] [a.III]. We used the Insecta core set 
(available at http://www.deep-phylogeny.org/hamstr/ 
download/datasets/hmmer2/) to build hidden Markov 
models (default settings). The genome of A. mellifera was 
chosen for the reciprocal BLAST search [31]. (If sequences 
of other taxa are processed, a different core set and a dif- 
ferent species for the reciprocal BLAST search will have to 
be selected.) We chose HaMStR as the currently most 
practicable tool to automatically assign orthology among 
nucleotide and EST sequence data. During the HaMStR 
orthology prediction, all nucleotide sequences are trans- 
lated into the corresponding amino acid sequences. 
Right path b 

The mitochondrial sequences and the nuclear noncod- 
ing sequences deposited in GenBank often include 
regions that span more than just one gene. In these 
instances, the script multiple_sequence_splitter uses 
information from the corresponding GenBank file to 
split sequences into fragments that correspond to single 



genes; that is, it creates multiple sequence files of single 
genes [b.II]. This step was serially applied to each file 
that we obtained from the previous step by means of a 
shell script. (See the multiple_sequence_splitter manual 
for a description of how to do this. Any step of the 
pipeline that had to be serially applied to a set of files 
was executed by means of a similar shell script [a.l, a.IV, 
b.l, b.II, b.IV, b.V, IV, V, VI, VII, IX, X, XI and XII].) In 
each of the obtained files, we used the script checking_- 
seq to check for consistent strand polarity and overall 
similarity between sequences [b.III]. This was done to 
revert sequences with deviating strand polarity, to 
exclude wrongly annotated sequences and to ensure that 
all sequences in a single-gene file were orthologous. The 
script checking_seq compares a template of a gene with 
all the sequences of the single-gene files that were cre- 
ated in step [b.II] in blast2seq [32]. The identity (blas- 
t2seq results) between template and target sequence had 
to be more than 15 nucleotides. Otherwise, the reverse 
complement of the target sequence was checked, and 
hits were reverted. If identities were still below the 
match threshold, the target sequences were compared 
with a second, third or fourth template. Primary tem- 
plates were taken from A. mellifera. (If sequences of 
other taxa are processed, other templates will have to be 
selected.) We randomly selected sequences from pre- 
viously successfully checked sequences as subsequent 
templates. A maximum of four templates were used 
before we finally discarded a sequence. Then, to prepare 
the remaining sequences for the subsequent alignment, 
all coding mitochondrial sequences were translated from 
nucleotide to corresponding amino acid sequences with 
the aid of the script dnalaa, which uses the respective 
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GenBank information for this task [b.V]. Steps b.IV and 
b.V of our pipeline are automatically consecutively exe- 
cuted when using the script batchl blVtobV.sh. (See 
manual of batch scripts for details.) 
Both paths 

Sequence headers of all sequences were standardized 
to ">species,family,gi no." with the aid of the script 
header_standardizer, which uses the data included in 
the GenBank entries [a.I and b.I]. If multiple sequences 
were available for a given species and gene after respec- 
tive steps [a.I to a.III] and [b.I to b.III], we chose the 
longest sequence from the unaligned multiple sequence 
files [a.IV and b.IV]. This was done by using the script 
choose_longest_seq. 
Converged paths 

We obtained numerous groups of orthologous sequences 
from path a and path b. Groups of orthologs that com- 
prised three or fewer species were deleted by the script 
small_groups_deleter [II]. To increase data density, we 
discarded all species with only a single sequence in the 
data set by using the script taxon_deleter and again 
deleted groups of orthologs with three or fewer species 
by using small_groups_deleter [III] . 

Multiple sequence alignment and alignment masking 

Orthologous sequences were aligned with MAFFT v6.712b 
using the auto option [IV]. Depending on the size of an 
alignment, MAFFT automatically chooses a suitable align- 
ment option, such as L-INS-i for < 200 sequences and 
FFT-NS-2 for > 2,000 sequences [33,34]. All alignments 
were subsequently refined with the refinement option in 
MUSCLE version 3.7 [35] [V]. These are powerful align- 
ment tools that allow processing very large data sets in 
reasonable time. Steps II through VI of our pipeline are 
automatically consecutively executed when using the 
script batch2_IItoVI.sh. (See the manual of batch scripts 
for details.) Aligned and refined mitochondrial amino acid 
sequences were then translated back into nucleotide 
sequences with the aid of the script aa2dna, which uses 
the corresponding reading frame information from the 
GenBank file [VI]. From this point on, we proceeded with 
nucleotide sequences for all mitochondrial sequences and 
nuclear noncoding sequences, as well as with amino acid 
sequences for the nuclear coding sequences (available 
since step [a.III]). 

Ambiguously aligned or highly diverged regions of the 
alignment were masked with three different algorithms 
[VII]. We applied ALISCORE [36,37] and ALICUT [38] 
for noncoding nucleotide sequences and for nuclear 
amino acid sequences (default settings). Since the multi- 
ple sequence alignment of 28S rRNA was too big to be 
processed with ALISCORE, we used Gblocks 0.91b 
[39,40] for 28S instead (block parameter settings: (1) 
number of included seq/2 = 1020, (2) 1020, (3) 5, (4) 10, 



and (5) all). Finally, we used the script gapkiller to iden- 
tify and delete sites with more than 70% gaps in coding 
mitochondrial sequences. Then we masked all third 
codon positions of mitochondrial coding sequences 
[VIII] and concatenated all tRNA alignments to one sin- 
gle alignment. 

Species and sequence subset selection 

In each group of orthologous sequences, we selected the 
largest group of species in which the sequences of all spe- 
cies overlap in at least 100 nucleotide or amino acid posi- 
tions [IX]. This was done with the aid of the script 
minimum_sequence_overlap. The script applies a maxi- 
mum clique algorithm. Generally, a maximum clique 
search is a way to find the largest group of items that fulfill 
a certain pairwise criterion. (See Additional file 3 for a 
short introduction to maximum cliques.) This approach is 
the formal solution to guarantee that our overlap criterion 
is fulfilled. Species that were not included in this first max- 
imum clique were considered again in a search for a sec- 
ond maximum clique using the same criteria and the same 
script as before [X]. So, for each gene, we retained two 
separate files with groups of orthologous sequences: the 
first and the second maximum clique, respectively. 
Sequences that were not included in either of the maxi- 
mum cliques were discarded. 

To identify sequences that showed compositional het- 
erogeneity in each group of orthologous nucleotide 
sequences, we used the script nucleotide_chi. The script 
applies a % 2 test (test procedure identical to the j 2 test 
implemented in TREE-PUZZLE [41]) and proceeds with 
excluding sequences with a base composition that signifi- 
cantly deviates until all sequences show compositional 
homogeneity [XI]. Since excluded sequences could com- 
prise another set of homogeneous sequences, they were 
again tested with the same procedure as before to obtain a 
second group of sequences with compositional homogene- 
ity [XII] . Sequences that did not end up in either of the 
two groups with compositional homogeneity were dis- 
carded. After discarding numerous sequences in steps IX 
through XII, we again excluded species with only one 
sequence in the data set by using the script taxon_deleter 
and groups of orthologs with three or fewer species by 
using the script small_groups_deleter [XIII]. Next, we 
pruned species-rich genera to the 15 species that were 
best represented in the data set by using the script prune_- 
genera. The representation criteria were, in this order, (1) 
the number of sequences in the data set and (2) the overall 
length of the sequence in the data set [XIV] . 

In a final subset selection step, we ensured that all 
species to be included in this subset overlap in at least 
one gene fragment of at least 100 nucleotide or amino 
acid positions [XV]. With the aid of the script reduce2- 
leading_gene, we pruned the data set to those species 
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that were present in the most sequence-rich group of 
orthologs. This was the largest group of species that ful- 
filled the overlap criterion. In case of Hymenoptera, this 
group was a group of COX1 sequences. All correspond- 
ing sequences were concatenated with the script conca- 
tenator to one supermatrix. This supermatrix is referred 
to as "subset 1" [XVI]. Steps IX through XVI of our 
pipeline are automatically consecutively executed when 
using the script batch3_IXtoXVI.sh. (See manual of 
batch scripts for details.) In addition to subset 1, we 
selected a second subset. To accomplish this, we made 
concessions to systematic considerations and added to 
subset 1 representatives of Hymenoptera families that 
were excluded by any of the previous filtering steps. If 
more than two species of the respective families were 
available, we selected the two best-represented species 
using criteria identical to those in step [XIV]. With 
those sequences reincluded in the respective groups of 
orthologs, the tests for compositional heterogeneity (as 
described in step [XI]) were repeated and all sequences 
were finally concatenated to a supermatrix. This super- 
matrix is referred to as "subset 2." 

Tree reconstruction 

Phylogenetic inference of subset 1 and of subset 2 was 
done under the maximum likelihood (ML) optimality cri- 
terion in partitioned analyses with RAxML 7.2.8 [42,43] 
under the GTRCAT model. Analyses were computed on 
HPC Linux clusters, 8 nodes with 12 cores each, at the 
Regionales Rechenzentrum Koln (RRZK) using Cologne 
High Efficient Operating Platform for Science (CHEOPS); 
input was done in phylip format; and conversion of Fasta 
to phylip was done using Readseq [44] [XVII]. Nuclear 
coding genes were treated as one partition (PROTCAT 
model, substitution matrix LG + F, taken from ProtTest 
[45]). All other groups of orthologs were treated as sepa- 
rate partitions (32 partitions in total). (See Additional file 
4 for the character partitions of subset 1 and 2.) We 
applied the rapid bootstrap algorithm [46] with a subse- 
quent tree search. The numbers of bootstrap replicates 
were estimated on the fly by the "bootstopping" criteria 
implemented in RAxML 7.2.8 (default settings) [47]. The 
analyses yielded two trees. These trees are referred to as 
"tree 1" (corresponding to subset 1) and "tree 2" (corre- 
sponding to subset 2). Trees were edited in Dendroscope 
[48] [XVIII]. 

Hymenoptera systematics 

We follow the terminology of [29] for supraspecific taxa 
of Hymenoptera. 

Results 

We downloaded 122,723 Hymenoptera sequences from 
GenBank 172.0 (as of 18 August 2009), including those 



of the nuclear genome of N. vitripennis (9,254 contigs). 
The annotation of the nuclear genome of A. mellifera 
was used as a reference when searching for orthologs 
(see Methods, step [a.III]), and corresponding sequences 
of this species were added during this step. After the first 
processing steps [a.I/b.I to II], including a search for 
orthologs, a sequence check with checking_seq, filtering 
for longest sequence per species and gene, and excluding 
groups of orthologs with fewer than four species, the 
data set included a total of 13,573 sequences from 4,536 
species and 375 genes. Step [III], the exclusion of species 
with only one sequence in the data set, led to the exclu- 
sion of 1,074 species and subsequently of 68 groups of 
orthologs. Accordingly, sequences of 3,462 species in 307 
groups of orthologs were aligned in step [IV] . The selec- 
tion of the first and second maximum cliques of species 
with an overlap of at least 100 nucleotides or amino acids 
[steps IX and X] and the subsequent tests for composi- 
tional heterogeneity [steps XI and XII] led to the exclu- 
sion of 669 species and reduced the data set to 2,793 
species. The pruning of species-rich genera to 15 species 
led to the exclusion of another 549 species [step XIV]. 
Pruned genera were Camponotus, Cardiocondyla, Dory- 
lus, Lasius, Myrmecocystus, Pheidole, Pogonomyrmex, 
Polyrhachis, Pseudomyrmex (Formicidae), Bombus, 
Diadasia, Euglossa, Xylocopa (Apidae), Colletes, Hylaeus 
(Colletidae), Aleiodes, Cotesia (Braconidae), Ceratosolen 
(Agaonidae), Andricus (Cynipidae), Neodiprion (Diprioni- 
dae), Pontania (Tenthredinidae), Megastigmus (Torymi- 
dae) and Polistes (Vespidae). 

After selecting the largest group of species that overlap 
in at least one group of orthologs [step XV], the final con- 
catenated data set (subset 1) included 1,146 species (46 
families), 222 groups of orthologs, 3,951 sequences and 
88,626 aligned sites. Data coverage in subset 1 (number of 
sequences number of groups of orthologs x number of 
species) was 1.55%. Tree reconstruction and 560 rapid 
bootstrap replicates took 8.3 days. Tree 1 obtained from 
subset 1 is shown in Figures 2 and 3 and Additional file 5. 

Subset 2 included an additional 115 sequences of 51 
species from 31 families. Overall, the concatenated subset 
2 consisted of 1,207 species (77 families), 222 groups of 
orthologs, 4,005 sequences and 88,807 aligned sites. The 
number of species is > 1,146 plus 51 due to repeated 
tests for compositional heterogeneity with slightly differ- 
ent results. (Both subsets are available at http://www. 
zfmk.de/web/Forschung/Molekularlabor/Datenstze/ 
index.en.html). Data coverage (number of sequences -r 
number of groups of orthologs x number of species) in 
subset 2 was 1.49%. Tree reconstruction and 512 rapid 
bootstrap replicates took 8.9 days. Tree 2 obtained from 
subset 2 is shown in Figure 4 and Additional file 6. All 
species and all groups of orthologs included in subsets 1 
and 2 are listed in Additional files 7, 8, 9 and 10. 
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Figure 2 Simplified phylogenetic tree of Hymenoptera inferred from GenBank sequences (tree 1 obtained from subset 1) The tree 
includes 1,142 species. The applied color code indicates major lineages. 



Discussion 

The aim of the present investigation was to develop a 
bioinformatics pipeline for retrieving, processing, filter- 
ing, editing and analyzing large amounts of sequence 
data from GenBank in a phylogenetic context. Instead of 
using supertree approaches to explore existing data (see, 
for example, [19,49]), we relied on a direct reanalysis of 
the sequence data. Smith et al. [50] presented an alterna- 
tive approach that they called a "mega-phylogeny 
approach", which also directly uses sequence data. It 



includes an a priori selection of gene regions of interest 
and an a priori separation of sequences into alleged 
monophyla with the aims of reducing the size of the 
supermatrix and improving alignment quality. A number 
of taxon-specific studies have also made use of GenBank 
sequence data, but those studies focused on specific 
genes (see, for example, [51,52]). We intended to avoid a 
priori decisions. In our pipeline, we suggest solutions for 
almost any obstacle that may appear along the way from 
sequence retrieval to tree reconstruction under the ML 
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Vespoidea 




Figure 3 Phylogenetic tree of Hymenoptera inferred from GenBank sequences (tree 1), reduced to family level. Numbers that follow the 
family names indicate the number of analyzed species. Numbers above branches indicate bootstrap support values (%). Values < 50% are 
omitted. The applied color code corresponds to that of Figure 2. Single species whose position in the inferred phylogenetic tree we consider 
erroneous are shown in gray. 



optimality criterion. In various regards, our approach is 
an extension and improvement of earlier efforts [2,4]. It 
offers an extended degree of automation in steps such as 
downloading from GenBank, sorting of sequences and 
translating and backtranslating sequences [steps I, b.II, b. 
V and VI] (Figure 1). Also, our approach includes 
improved quality management, such as by automatically 
checking the GenBank sequences for strand polarity and 
annotation, by masking problematic alignment regions 
and by handling compositional heterogeneity [steps b.III, 
VII and XI] (Figure 1). Our data selection steps [for 
example, steps III, IX and XV] (Figure 1) guarantee stan- 
dardized levels of the density of the data set and of 
sequence overlap between included species. By choosing 
a minimum sequence overlap of 100 positions, we 



attempted to find a reasonable compromise between 
sequence overlap and number of species in the analysis. 
A larger overlap would have led to a significant decrease 
of the number of species in our phylogenetic tree. 
Furthermore, the present study is an update in terms of 
tree reconstruction facilities. We have, for the first time, 
applied a ML algorithm to such a large amount of Gen- 
Bank data [step XVII] (Figure 1). Our approach is more 
general and independent of the taxonomic group. Finally, 
our bioinformatics solution is transparent and user- 
friendly. We provide all new scripts with respective com- 
ments and detailed manuals as part of this publication so 
that the pipeline is ready for use by anybody interested. 
In the following paragraphs, we discuss the results of our 
exemplary pipeline run with Hymenoptera data. 
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Formicidae 300 
Tiphiidae 1 " 
Muiillidao 1 
Plumariidae 2 * 
Vespidae- 22 
Vespidae 2 
Tiphiidae 1 " 
Vespidae 1 
Vespidae 1 
Vespidae 2 
Vespidae 7 
Beihylidae 1 
Sicrelornorphidae 1 * 
Chrysididae 34 
Apidae 136 
Megachilidae 7 
Colletidae 59 
Sienotriiidae 2 ' 
Halictidae 3 
Andrenidae 2 
Melittidae 2 ' 
Crabronidae 2 * 
Megalyridae 1 
Sphecidae 1 
Pompilidae 1 





Encyrtidae 1 * 
Maamingidae 1 * 
Mymarammatidae 1 * 
Braconidae 226 
Ichneumonidae 22 
Orussidae 2 * 
Proctotrupidae 1 * 
Vanhorniidae 1 
Slephanidae 2 
Xiphydriidao 1 
Siricidae 3 
Anaxyelidae 1 
Cephidae 4 
Pamphiliidae 5 




Tenthredinidae 49 
Cimbicidae 4 
Diprionidae 18 
Tenihredinidao 1 



Ceraphronoidea 



| Mymarommatoidea 
Ichneumonoidea 
Orussoidea 

| Stephanoidea 
Xiphydrioidea 
Siricoidea 
Cephoidea 
Pamphilioidea 
Xyeloidea 



'Tenthredinoidea' 



■ Pergidae 3 



Figure 4 Phylogenetic tree of Hymenoptera inferred from GenBank sequences (tree 2 obtained from subset 2), reduced to family 
level. In this tree, species that were excluded by our pipeline in the course of generating subset 1 are reincluded. These taxa are marked with 
asterisks. The meaning of numbers and the applied color code correspond to those in Figure 3. 



Data set and analysis 

One of the main characteristics of data sets when com- 
bining sequence data from independently conducted 
investigations is data scarcity; that is, the lack of data 
overlap. Data distribution in supermatrices is unbalanced, 



and, as a consequence, there is a huge amount of missing 
data. However, data sets do not necessarily have to be 
complete to provide phylogenetic information. In fact, 
there is evidence that even with very low coverage, reli- 
able phylogenetic estimates can be obtained (see, for 
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example, [53]). The sheer proportion of missing data is 
not decisive as long as the number of characters scored is 
sufficient to correctly place the taxa in the tree [54]. 
Accordingly, we tried to cope with the problem of data 
scarcity by ensuring a minimum sequence overlap 
between taxa and a standardized data set density [steps 
III, IX, XIII, XIV and XV] (Figure 1). Still, our Hymenop- 
tera data matrix is very large and exhibits very low cover- 
age (1.5%). This is a direct consequence of the 
characteristics of the original sequence information pre- 
sent in GenBank. A large number of species for which 
only few sequences are available contrasts with a small 
number of species for which the transcriptome, the mito- 
chondrial genome or even the entire nuclear genome 
have been sequenced. By combining all of these data in a 
single analysis, this data set will inevitably become large 
and unbalanced and will suffer from low overlap between 
taxa. Irrespective of the fact that sequencing is getting 
cheaper and faster and that phylogenomic data will 
rapidly increase the size of data sets, the data characteris- 
tics described herein are still expected to prevail in the 
near future. The challenge is to find optimal subsets for 
phylogenetic analysis in order to explore available infor- 
mation and to subsequently identify and fill the most 
severe gaps via target-specific sequencing. Accordingly, 
one of the goals of our approach has been to identify 
unstable nodes and to suggest future foci of molecular 
phylogenetic studies, in Hymenoptera, for an effective, 
economical and time-saving process. 

For tree reconstruction, we performed supermatrix ML 
analyses. To the best of our knowledge, this is the largest 
set of eukaryotic real data studied using ML analysis. Past 
studies that utilized very large data sets applied supertrees 
or parsimony analyses. For example, McMahon and 
Sanderson [2] and Thomson and Shaffer [4] applied maxi- 
mum parsimony analyses with supermatrices in their pipe- 
lines, but stated that they based this decision mainly on 
speed and computational capacity. However, with the lat- 
est program version of RAxML implementing partitioned 
analysis, rapid bootstrap functions, and the ability of paral- 
lel analyses, even very large data sets, can be analyzed in a 
reasonable amount of time. In the next few years, systema- 
tic biologists' access to multicore computers will get easier 
and broader, and high-performance computing (HPC) will 
become routine. At the moment, subsets should be con- 
strained in size to allow ML analysis. During our work, we 
set an approximate maximum of 1,500 taxa and 100,000 
sites. Phylogenetic analyses of subsets of this size take a 
maximum of two weeks on a fully parallelized HPC unit 
such as the one that we used. Unless one wants to analyze 
data sets that are significantly larger than ours, there is no 
computational or speed argument left to perform super- 
tree or parsimony methods in favor of ML analyses. 
Accordingly, our approach was designed to prepare data 



for ML analysis. However, if a user wants to apply other 
algorithms for tree reconstruction (for example, maximum 
parsimony) or to adjust parameters (for example, to seek 
an extension of exploration of tree space or a comparison 
between inferred trees), the supermatrix produced by 
our pipeline can be used just as well (after step XVI) 
(Figure 1). 

The phylogeny of Hymenoptera 

We have restricted our results and discussion to (1) new 
contributions to the phylogeny of major lineages within 
Apocrita and to the monophyly and phylogeny of Procto- 
trupomorpha, (2) the recovery of some noncontroversial 
relationships and (3) the diagnosis of persistent problems 
and possible solutions. Phylogenetic relations within 
Hymenoptera are far too numerous and complex to be 
exhaustively discussed. The complete trees in Additional 
files 5 and 6 can be consulted for lower systematic level 
relationships. 

In the following subsections, we repeatedly refer to sin- 
gle species as "misplaced". This means that their position 
as inferred in our trees clearly contradicts previous results 
from taxonomic as well as morphological and molecular 
phylogenetic studies. Accordingly, the phylogenetic posi- 
tions of these taxa were considered artefacts and were 
excluded from discussion of topologies. 

Major lineages within Apocrita 

Within Apocrita, our analysis suggests a topology of Ste- 
phanoidea + (Ichneumonoidea + (Proctotrupomorpha + 
(Evanioidea + Aculeata))) (with misplacement of a single 
Vanhorniidae as sister to Stephanoidea being ignored) 
(Figure 3). Stephanoidea was inferred to be sister group to 
all other Apocrita in the morphological analyses of 
Vilhelmsen et al. [17]. Our analysis gives additional support 
for this relationship. The Ichneumonoidea are monophy- 
letic in our trees. (Misplacement of a single Trigonalidae as 
sister to Braconidae is ignored.) Ichneumonoidea has been 
suggested as sister group to Aculeata by Rasnitsyn [13], a 
relationship that found only moderate support from Vil- 
helmsen et al. [17] and was not retrieved by most recent 
analyses (see, for example, [16,21,24,55,56]). Our trees cor- 
roborate the results of most analyses cited above and sug- 
gest a rejection of the clade Aculeata + Ichneumonoidea. 
Instead, we found Evanioidea to be sister group to Aculeata 
in our trees. A sister group relationship of Evanioidea and 
Aculeata has been suggested only by the combined mor- 
phological and molecular analysis by Sharkey et al. [57], 
and there are currently no convincing morphological syna- 
pomorphies that would support this clade. However, 
despite low branch support, we consider it quite possible 
that the Evanioidea are the long-sought sister group to 
the Aculeata and suggest further investigation of this parti- 
cular clade. Rasnitsyn [13] introduced the supertaxon 
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Evaniomorpha, which includes Evanioidea, Ceraphronoi- 
dea, Megalyroidea, Trigonaloidea and Stephanoidea. We 
argue against the monophyly of Evaniomorpha, as our data 
support Stephanoidea as sister taxon of the remaining 
Apocrita (corroborating Vilhelmsen et al. [17]). We cannot 
provide substantial information on the position of the 
superfamilies Ceraphronoidea, Megalyroidea and Trigona- 
loidea, because their representatives are either included 
solely in the extended, possibly less reliable tree 2 (Ceraph- 
ronoidea) or obviously misplaced (Megalyroidea and 
Trigonaloidea). 

Proctotrupomorpha 

In our analyses, Proctotrupomorpha s.l. (that is, sensu 
Rasnitsyn 1988 [13]) was retrieved when again ignoring a 
few misplaced taxa. In tree 1, Proctotrupomorpha com- 
prises Chalcidoidea, Platygastroidea and Cynipoidea (all 
of which are monophyletic, forming Cynipoidea + (Platy- 
gastroidea + Chalcidoidea)). In tree 2, more representa- 
tives of Proctotrupomorpha s.l. are present, and the 
inferred topology suggests the following relationships: 
Cynipoidea + (Platygastroidea + (Mymarommatoidea + 
(Diaprioidea + Chalcidoidea))). This contradicts the often 
proposed sister group relationship between Mymarom- 
matoidea and Chalcidoidea (see, for example, [24,57,58]; 
but see the ambiguity in [17]). A sister group relationship 
between Diaprioidea and Chalcidoidea was retrieved in 
the molecular analyses of Castro and Dowton [56], but 
their taxon sampling lacked Mymarommatoidea, and was 
retrieved by Heraty et al. [21]. Our study is one of the 
first to include Mymarommatoidea in a molecular phylo- 
genetic analysis, but the position of Mymarommatoidea 
in our analysis is not well supported and the group is 
represented only in the less reliable tree 2. A position of 
Chalcidoidea outside Proctotrupomorpha was recently 
proposed by Sharanowski et al. [20] based on the analysis 
of 24 putative orthologous genes (derived from ESTs) 
from a small number of taxa. We regard this position as 
unlikely based on our own results and those of previous 
molecular studies that provided respective parts of our 
data set [16,21,56]. The most recent morphological or 
combined morphological and molecular analyses also 
contradict an origin of Chalcidoidea outside Proctotrupo- 
morpha [17,57]. 

Recovery of noncontroversial relationships 

We evaluated the reliability of the inferred phylogenetic 
trees by the recovery of phylogenetic relationships that are 
largely considered noncontroversial. We found positive 
indications in tree 1. Specifically, our results are consistent 
with the generally accepted paraphyly of "Symphyta" (see, 
for example, [24]) and with the generally accepted mono- 
phyly of Apocrita and Aculeata (see, for example, [24,28]) 
(with misplacement of one Megalyridae within Aculeata 



being ignored). Also, we retrieved the noncontroversially 
monophyletic superfamilies Apoidea, Chalcidoidea, Cyni- 
poidea, Evanioidea, Ichneumonoidea and Siricoidea. How- 
ever, some crucial taxa were not represented in tree 1: 
Xyelidae and Orussidae. If we add them to the data set to 
infer tree 2, they are misplaced. The Xyelidae are found as 
a sister group to Pamphilioidea (Figure 4). This position is 
not very likely, as the sister group relationship of Xyelidae 
and the remaining Hymenoptera is well supported 
[25-27]. The Orussidae, which have a key position within 
Hymenoptera evolution as sister group of Apocrita, are 
placed at the base of Apocrita along with some Proctotru- 
poidea taxa (Figure 4). However, the clade Orussidae + 
Apocrita is well established and supported by morphologi- 
cal and molecular data (see, for example, [13,17,18,57]). 
This demonstrates the necessity of sequence overlap defi- 
nitions and shows that the positions of reincluded taxa 
(indicated by asterisks in Figure 4 and Additional file 6) 
have to be discussed with caution. The backbone of the 
tree, with its major splits, however, remains largely unaf- 
fected by adding taxa that do not fulfill our overlap 
criteria. 

Diagnosis of persistent problems and possible solutions 

With the aid of our trees, we identified several persistent 
problems in the Hymenoptera tree. While the available 
sequence data already cover all major lineages of Hyme- 
noptera, they are unequally distributed and there is poor 
overlap among taxa. This contradiction between taxo- 
nomic breadth and genomic depth in the data of Hyme- 
noptera is in accordance with the conclusions of 
Sanderson [59] in his evaluation of the phylogenetic signal 
in Eukaryota. The large amount of missing data and the 
low taxonomic overlap between mitochondrial and 
nuclear data in our sets call for a solution. To get more 
independent markers and to close the taxonomic gap 
between mitochondrial and nuclear data, we suggest EST 
studies (nuclear genes) for taxa with completely sequenced 
mitochondrial genomes and sequencing of mitochondrial 
genomes of those taxa for which we already have a large 
number of nuclear sequence data available. 

An obvious problem for solving higher-level relation- 
ships within Hymenoptera is the underrepresentation of 
the small superfamilies Megalyroidea, Trigonaloidea, 
Ceraphronoidea and Mymarommatoidea. Another highly 
problematic issue is those families of Proctotrupoidea 
that we currently cannot map on the phylogenetic tree. 
Any additional data regarding these taxa in terms of 
species and genes will be of great value. 

As extensive EST studies are still expensive, we also 
recommend target-specific amplification of nuclear coding 
genes. With the prospect of new primer design tools 
(J. Borner, C. Pick, T. Burmester, unpublished data), 
amplification and sequencing of a data set of, for example, 
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22 taxa (all superfamilies) and 50 nuclear coding genes can 
be accomplished in a reasonable amount of time and at 
reasonable cost. Taxon sampling should again be based on 
taxa with completely sequenced mitochondrial genomes. 

Conclusions 

Exemplarily for Hymenoptera, we have demonstrated 
that the tree reconstructed from our pipeline output can 
make a substantial contribution to the phylogeny of the 
taxon and that comprehensive results can complement 
the discrete inferences from the single studies that have 
produced the data that were reanalyzed. Inspired by 
McMahon and Sanderson [2] and Sanderson et al. [3], 
we found an adequate approach to analyze all currently 
available molecular data in a single phylogenetic study in 
a standardized and efficient way. The impossible child of 
the scientific community, a sequence data monster, can 
be tamed. Every systematic biologist, even without 
advanced programming and bioinformatics skills, is given 
the capability to produce a tree of his taxon of interest. 
Our approach offers the possibility of relatively simple 
and reliable monitoring of new data and tree robustness, 
that is, the possibility to keep track of the phylogenetic 
signal in a taxonomic group. This also enables research- 
ers to monitor how phylogenetic trees change over time 
with an increase of data size and density. This might pro- 
mote a better understanding of more theoretical issues 
related to the analyses of molecular data, such as the 
information content of genes or the suitability and selec- 
tion of genes to answer phylogenetic questions. Our 
approach therefore has great potential to meet the chal- 
lenges of the phylogenomic era, to improve our ideas on 
phylogenetic affinities and to contribute to a better 
understanding of the evolution of organisms. 

Additional material 



numbers and the applied color code correspond to those in Additiona 
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